Lock that uses congestion detection for self-tuning #93879

VSadov · 2023-10-23T16:11:52Z

This is basically the same as #87672 with some additions

The main difference is that TryEnterSlow uses the same algorithm as in NativeAOT lock, so that congestion detection could be used for self-tuning.

The essence of the algorithm is:

when we try to acquire the lock on the slow path loop, we also observe whether the lock is changing its ownership.
if, while we were trying to acquire, the lock was owned by 2 other threads, we consider that as a signal that there is too much competition and dial down the spin limit.
shorter spin limit makes unsuccessful spinners to sleep earlier, which statistically reduces the overall number of concurrent spinners and decreases competition.
conversely, if successful spinning gets close to the limit and no congestion is seen, the limit is increased.
every thread that acquires the lock via the slow path gets to cast a vote on how the acquiring went (too crowded, could allow more spin, nothing special)

The goal of this scheme is to scale better in cases of heavy concurrent use of the lock (i.e. > 4-8 threads).

dotnet-issue-labeler · 2023-10-23T16:12:00Z

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

ghost · 2023-10-23T16:12:08Z

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas
See info in area-owners.md if you want to be subscribed.

Issue Details

This is basically the same as #87672

The difference is that TryEnterSlow uses the same algorithm as in NativeAOT lock, so that congestion detection could be used for self-tuning.

Author:	VSadov
Assignees:	VSadov
Labels:	`new-api-needs-documentation`, `area-NativeAOT-coreclr`
Milestone:	-

ghost · 2023-10-23T18:12:19Z

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

Issue Details

This is basically the same as #87672 with some additions

The difference is that TryEnterSlow uses the same algorithm as in NativeAOT lock, so that congestion detection could be used for self-tuning.

The essence of the algorithm is:

when we try to acquire the lock on the slow path loop, we also observe whether the lock is changing its ownership.
if, while we were trying to acquire, the lock was passed between 2 other threads, we consider that as a signal that there is too much competition and dial down the spin limit.
shorter spin limit makes unsuccessful spinners to sleep earlier, which statistically reduces the overall number of concurrent spinners and decreases competition.
conversely, if successful spinning gets close to the limit and no congestion is seen, the limit is increased.
every thread that acquires the lock via the slow path gets to cast a vote on how the acquiring went (too crowded, could allow more spin, nothing special)

The goal of this scheme is to scale better in cases of heavy concurrent use of the lock (i.e. > 4-8 threads).

Author:	VSadov
Assignees:	VSadov
Labels:	`area-System.Threading`, `new-api-needs-documentation`, `area-NativeAOT-coreclr`
Milestone:	-

VSadov · 2023-10-23T20:12:04Z

Here is an example of using this Lock in a short-held scenario.
"Short-held" here means that threads spend relatively little time inside the lock compared to the time outside of the lock.

Scenarios like 50/50 inside/outside can also be measured, but less interesting as such scenario on its own has little room to scale beyond 2 threads. (3 threads can't all spend 50% of their total time inside the same lock)

The code:

using System.Diagnostics;

namespace ConsoleApp12
{
    internal class Program
    {
        private const int iters = 10000000;

        static void Main(string[] args)
        {
            for (; ; )
            {
                for (int i = 0; i < 7; i++)
                {
                    int thrCount = 1 << i;
                    System.Console.WriteLine("Threads:" + thrCount);

                    for (int j = 0; j < 4; j++)
                    {
                        System.Console.Write("Fat Lock: ");
                        RunMany(() => FatLock(thrCount), thrCount);
                    }
                }

                System.Console.WriteLine();
            }
        }

        static void RunMany(Action f, int threadCount)
        {
            Thread[] threads = new Thread[threadCount];
            bool start = false;

            for (int i = 0; i < threads.Length; i++)
            {
                threads[i] = new Thread(
                    () =>
                    {
                        while (!start) Thread.SpinWait(1);
                        f();
                    }
                );

                threads[i].Start();
            }

            Thread.Sleep(10);
            start = true;

            Stopwatch sw = Stopwatch.StartNew();
            for (int i = 0; i < threads.Length; i++)
            {
                threads[i].Join();
            }

            System.Console.WriteLine("Ops per msec: " + iters / sw.ElapsedMilliseconds);
        }

        private static int Fib(int n) => n < 2 ? 1 : Fib(n - 1) + Fib(n - 2);
        private static int ComputeSomething(Random random)
        {
            int delay = random.Next(4, 10);
            return Fib(delay);
        }

        static System.Threading.Lock fatLock = new System.Threading.Lock();

        public static int sharedCounter = 0;
        public static Dictionary<int, int> sharedDictionary = new Dictionary<int, int>();

        static void FatLock(int thrCount)
        {
            Random random = new Random();
            for (int i = 0; i < iters / thrCount; i++)
            {
                // do some computation
                int value = ComputeSomething(random);
                var scope = fatLock.EnterScope();
                {
                    // update shared state
                    sharedCounter += value;
                    sharedDictionary[i] = sharedCounter;
                }
                scope.Dispose();
            }
        }
    }
}

Results on:

On Windows10 x64
AMD Ryzen 9 5950X 16-Core Processor
Logical processors: 32

Higher number is better.

=== Lock with congestion sensing (in this PR).

Threads:1
Fat Lock: Ops per msec: 22988
Fat Lock: Ops per msec: 23201
Fat Lock: Ops per msec: 23310
Fat Lock: Ops per msec: 23201
Threads:2
Fat Lock: Ops per msec: 20833
Fat Lock: Ops per msec: 20703
Fat Lock: Ops per msec: 20283
Fat Lock: Ops per msec: 20618
Threads:4
Fat Lock: Ops per msec: 16103
Fat Lock: Ops per msec: 16181
Fat Lock: Ops per msec: 16051
Fat Lock: Ops per msec: 16129
Threads:8
Fat Lock: Ops per msec: 16835
Fat Lock: Ops per msec: 16750
Fat Lock: Ops per msec: 16863
Fat Lock: Ops per msec: 16977
Threads:16
Fat Lock: Ops per msec: 16891
Fat Lock: Ops per msec: 17035
Fat Lock: Ops per msec: 16863
Fat Lock: Ops per msec: 16920
Threads:32
Fat Lock: Ops per msec: 16694
Fat Lock: Ops per msec: 16778
Fat Lock: Ops per msec: 16778
Fat Lock: Ops per msec: 16835
Threads:64
Fat Lock: Ops per msec: 16835
Fat Lock: Ops per msec: 16863
Fat Lock: Ops per msec: 16863
Fat Lock: Ops per msec: 16778

=== results for #87672

Threads:1
Fat Lock: Ops per msec: 23041
Fat Lock: Ops per msec: 23529
Fat Lock: Ops per msec: 23255
Fat Lock: Ops per msec: 23474
Threads:2
Fat Lock: Ops per msec: 20920
Fat Lock: Ops per msec: 20790
Fat Lock: Ops per msec: 20920
Fat Lock: Ops per msec: 20920
Threads:4
Fat Lock: Ops per msec: 11507
Fat Lock: Ops per msec: 11534
Fat Lock: Ops per msec: 11148
Fat Lock: Ops per msec: 11025
Threads:8
Fat Lock: Ops per msec: 9199
Fat Lock: Ops per msec: 9225
Fat Lock: Ops per msec: 9174
Fat Lock: Ops per msec: 9208
Threads:16
Fat Lock: Ops per msec: 9208
Fat Lock: Ops per msec: 9157
Fat Lock: Ops per msec: 9157
Fat Lock: Ops per msec: 9157
Threads:32
Fat Lock: Ops per msec: 9165
Fat Lock: Ops per msec: 9149
Fat Lock: Ops per msec: 9140
Fat Lock: Ops per msec: 9165
Threads:64
Fat Lock: Ops per msec: 9115
Fat Lock: Ops per msec: 9140
Fat Lock: Ops per msec: 9099
Fat Lock: Ops per msec: 9090

VSadov · 2023-10-23T20:20:35Z

Here is the results for the same benchmark as above, but with 2 locks. Sometimes a program has more than one lock.

It is the same driver code, just this part is different:

. . . 
. . . 
        static System.Threading.Lock fatLock1 = new System.Threading.Lock();
        static System.Threading.Lock fatLock2 = new System.Threading.Lock();

        public static int sharedCounter1 = 0;
        public static int sharedCounter2 = 0;

        public static Dictionary<int, int> sharedDictionary1 = new Dictionary<int, int>();
        public static Dictionary<int, int> sharedDictionary2 = new Dictionary<int, int>();

        static void FatLock(int thrCount)
        {
            Random random = new Random();
            for (int i = 0; i < iters / thrCount; i++)
            {
                // do some computation
                int value = ComputeSomething(random);

                if (i % 2 == 0)
                {
                    var scope = fatLock1.EnterScope();
                    {
                        // update shared state
                        sharedCounter1 += value;
                        sharedDictionary1[i] = sharedCounter1;
                    }
                    scope.Dispose();

                }
                else
                {
                    var scope = fatLock2.EnterScope();
                    {
                        // update shared state
                        sharedCounter2 += value;
                        sharedDictionary2[i] = sharedCounter2;
                    }
                    scope.Dispose();
                }
            }
        }

=== Congestion-sensing (this PR)

Threads:1
Fat Lock: Ops per msec: 22471
Fat Lock: Ops per msec: 22779
Fat Lock: Ops per msec: 22779
Fat Lock: Ops per msec: 22779
Threads:2
Fat Lock: Ops per msec: 23640
Fat Lock: Ops per msec: 23640
Fat Lock: Ops per msec: 23584
Fat Lock: Ops per msec: 23529
Threads:4
Fat Lock: Ops per msec: 17241
Fat Lock: Ops per msec: 17152
Fat Lock: Ops per msec: 17182
Fat Lock: Ops per msec: 17182
Threads:8
Fat Lock: Ops per msec: 15015
Fat Lock: Ops per msec: 14792
Fat Lock: Ops per msec: 15015
Fat Lock: Ops per msec: 15037
Threads:16
Fat Lock: Ops per msec: 14347
Fat Lock: Ops per msec: 14347
Fat Lock: Ops per msec: 14285
Fat Lock: Ops per msec: 14245
Threads:32
Fat Lock: Ops per msec: 13869
Fat Lock: Ops per msec: 13888
Fat Lock: Ops per msec: 13831
Fat Lock: Ops per msec: 14025
Threads:64
Fat Lock: Ops per msec: 13736
Fat Lock: Ops per msec: 13605
Fat Lock: Ops per msec: 13661
Fat Lock: Ops per msec: 13736

=== Base PR

Threads:1
Fat Lock: Ops per msec: 22471
Fat Lock: Ops per msec: 22321
Fat Lock: Ops per msec: 22421
Fat Lock: Ops per msec: 22371
Threads:2
Fat Lock: Ops per msec: 27700
Fat Lock: Ops per msec: 27548
Fat Lock: Ops per msec: 27173
Fat Lock: Ops per msec: 27548
Threads:4
Fat Lock: Ops per msec: 14598
Fat Lock: Ops per msec: 14577
Fat Lock: Ops per msec: 14534
Fat Lock: Ops per msec: 14556
Threads:8
Fat Lock: Ops per msec: 7806
Fat Lock: Ops per msec: 7782
Fat Lock: Ops per msec: 7733
Fat Lock: Ops per msec: 7824
Threads:16
Fat Lock: Ops per msec: 4844
Fat Lock: Ops per msec: 4847
Fat Lock: Ops per msec: 4821
Fat Lock: Ops per msec: 4840
Threads:32
Fat Lock: Ops per msec: 4791
Fat Lock: Ops per msec: 4793
Fat Lock: Ops per msec: 4789
Fat Lock: Ops per msec: 4844
Threads:64
Fat Lock: Ops per msec: 4759
Fat Lock: Ops per msec: 4761
Fat Lock: Ops per msec: 4761
Fat Lock: Ops per msec: 4752

VSadov · 2023-10-23T20:46:27Z

In the last example, the base PR ends up with 3X worse throughput than the congestion-sensing approach.

The reason is excessive spinning in a lock that is heavily contested. When threads can't acquire the lock in one shot, they will try acquiring again in a loop, which would succeed, eventually - at a great cost, since spinning on highly contested state is expensive. (many cache misses, unnecessary transfers of the cache line ownership between cores, possibly some thermal effects, ...).
The implementation uses the "success" as a signal to allow even more spinning...

Excessive spinning also takes resources that could be used by other threads not involved in this lock, even if that is another lock under similar conditions. Thus two locks end up behaving much worse than just one.

While the system still makes progress, it is more fruitful for some of the contestants in busy locks to leave the "traffic jam" and take a nap, while the rest of the contestants can do the same work, but with less overhead.

NOTE: this should not sound as simply "spinning is bad". Spinning is good when it is cheap. It is the expensive spinning that should be avoided.

kouvel · 2023-10-24T00:37:13Z

This PR is premature. Please see the plenty of information shared in #87672 and take it into consideration before raising PRs.

kouvel · 2023-10-26T21:10:28Z

Let's go ahead and reopen this PR, I should probably let you manage your own PRs, and we're continuing to discuss on next steps.

VSadov · 2023-10-27T17:36:12Z

/azp run runtime-extra-platforms

azure-pipelines · 2023-10-27T17:36:35Z

Azure Pipelines successfully started running 1 pipeline(s).

VSadov · 2023-11-03T15:35:25Z

/azp run runtime-extra-platforms

azure-pipelines · 2023-11-03T15:35:49Z

Azure Pipelines successfully started running 1 pipeline(s).

VSadov · 2023-11-23T23:11:20Z

/azp run runtime-nativeaot-outerloop

azure-pipelines · 2023-11-23T23:11:29Z

Azure Pipelines successfully started running 1 pipeline(s).

VSadov · 2023-12-18T20:05:19Z

/azp run runtime-nativeaot-outerloop

azure-pipelines · 2023-12-18T20:05:35Z

Azure Pipelines successfully started running 1 pipeline(s).

ghost · 2024-01-17T23:02:12Z

Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it.

VSadov · 2024-04-19T16:39:56Z

src/coreclr/nativeaot/System.Private.CoreLib/src/System/Threading/Lock.NativeAot.cs

+            s_minSpinCount = DefaultMinSpinCount << SpinCountScaleShift;
+
+            // we can now use the slow path of the lock.
+            Volatile.Write(ref s_staticsInitializationStage, (int)StaticsInitializationStage.Usable);


The lock is now functional and did not do anything that could take locks. The rest of initialization is optional and just need to eventually happen.

ghost assigned VSadov Oct 23, 2023

dotnet-issue-labeler bot added area-NativeAOT-coreclr new-api-needs-documentation labels Oct 23, 2023

jkotas added area-System.Threading and removed area-NativeAOT-coreclr labels Oct 23, 2023

build-analysis bot mentioned this pull request Oct 23, 2023

Test failure: Wasm.Build.Tests.NativeBuildTests.MonoAOTCross_WorksWithNoTrimming #93522

Closed

kouvel closed this Oct 24, 2023

kouvel reopened this Oct 26, 2023

This was referenced Oct 27, 2023

Timeout in System.Net.Quic.Functional.Tests #86019

Closed

CI error: System.Net.Quic.QuicException: The connection timed out from inactivity #91757

Closed

VSadov force-pushed the lock2s branch from 10bb377 to 9e9f7b2 Compare October 27, 2023 04:54

build-analysis bot mentioned this pull request Oct 27, 2023

MSBuild crashing in the build #92290

Open

runfoapp bot mentioned this pull request Oct 27, 2023

system.text.regularexpressions.tests work item #72834

Open

build-analysis bot mentioned this pull request Oct 30, 2023

System.Data.OleDb.Tests timeout in net48 x86 Release leg #87783

Open

ilonatommy mentioned this pull request Oct 31, 2023

[browser][hybridGlobalization] WBT: Hybrid Globalization fails with AOT #94212

Closed

VSadov force-pushed the lock2s branch 3 times, most recently from f4c8351 to d58eade Compare November 3, 2023 04:09

VSadov force-pushed the lock2s branch from 9a2c67c to 5a5e5f1 Compare November 19, 2023 20:52

VSadov removed the new-api-needs-documentation label Nov 19, 2023

dotnet deleted a comment from azure-pipelines bot Nov 19, 2023

build-analysis bot mentioned this pull request Nov 19, 2023

DisabledRuntimeMarshalling.PInvokeAssemblyMarshallingEnabled.DelegatesFromExternalAssembly.StructWithDefaultNonBlittableFields() failing #94931

Closed

VSadov mentioned this pull request Nov 20, 2023

[NativeAOT] Fix Lock's usage of NativeRuntimeEventSource.Log to account for recursive accesses during its own class construction #94873

Merged

VSadov force-pushed the lock2s branch from 5a5e5f1 to 1c4bd55 Compare November 20, 2023 01:18

dotnet deleted a comment from azure-pipelines bot Nov 20, 2023

build-analysis bot mentioned this pull request Nov 20, 2023

Test_EventSource_EtwManifestGeneration* tests failing in CI #48798

Closed

VSadov added 5 commits November 20, 2023 12:21

Use contention detection for self-tuning

4431336

small tweaks

2cd2fd6

more fine-grained spincount computation

164898e

backoff cleanup.

eec1c81

use s_contentionCount as additional source of randomness

047a59c

VSadov force-pushed the lock2s branch from 1c4bd55 to 047a59c Compare November 23, 2023 23:07

dotnet deleted a comment from azure-pipelines bot Nov 23, 2023

Update a comment.

344ea2a

dotnet deleted a comment from azure-pipelines bot Dec 18, 2023

build-analysis bot mentioned this pull request Dec 18, 2023

Segmentation fault in System.Text.RegularExpressions.Tests #93206

Closed

ghost closed this Jan 17, 2024

github-actions bot locked and limited conversation to collaborators Feb 17, 2024

VSadov commented Apr 19, 2024

View reviewed changes

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lock that uses congestion detection for self-tuning #93879

Lock that uses congestion detection for self-tuning #93879

VSadov commented Oct 23, 2023 •

edited

Loading

dotnet-issue-labeler bot commented Oct 23, 2023

ghost commented Oct 23, 2023

ghost commented Oct 23, 2023

VSadov commented Oct 23, 2023 •

edited

Loading

VSadov commented Oct 23, 2023

VSadov commented Oct 23, 2023 •

edited

Loading

kouvel commented Oct 24, 2023

kouvel commented Oct 26, 2023

VSadov commented Oct 27, 2023

azure-pipelines bot commented Oct 27, 2023

VSadov commented Nov 3, 2023

azure-pipelines bot commented Nov 3, 2023

VSadov commented Nov 23, 2023

azure-pipelines bot commented Nov 23, 2023

VSadov commented Dec 18, 2023

azure-pipelines bot commented Dec 18, 2023

ghost commented Jan 17, 2024

VSadov Apr 19, 2024

Lock that uses congestion detection for self-tuning #93879

Lock that uses congestion detection for self-tuning #93879

Conversation

VSadov commented Oct 23, 2023 • edited Loading

dotnet-issue-labeler bot commented Oct 23, 2023

ghost commented Oct 23, 2023

ghost commented Oct 23, 2023

VSadov commented Oct 23, 2023 • edited Loading

VSadov commented Oct 23, 2023

VSadov commented Oct 23, 2023 • edited Loading

kouvel commented Oct 24, 2023

kouvel commented Oct 26, 2023

VSadov commented Oct 27, 2023

azure-pipelines bot commented Oct 27, 2023

VSadov commented Nov 3, 2023

azure-pipelines bot commented Nov 3, 2023

VSadov commented Nov 23, 2023

azure-pipelines bot commented Nov 23, 2023

VSadov commented Dec 18, 2023

azure-pipelines bot commented Dec 18, 2023

ghost commented Jan 17, 2024

VSadov Apr 19, 2024

Choose a reason for hiding this comment

VSadov commented Oct 23, 2023 •

edited

Loading

VSadov commented Oct 23, 2023 •

edited

Loading

VSadov commented Oct 23, 2023 •

edited

Loading